Reinforcement Learning-Based Model Predictive Control for Discrete-Time Systems

Publisher: IEEE

Abstract:
This article proposes a novel reinforcement learning-based model predictive control (RLMPC) scheme for discrete-time systems. The scheme integrates model predictive control (MPC) and reinforcement learning (RL) through policy iteration (PI), where MPC is a policy generator and the RL technique is employed to evaluate the policy. Then the obtained value function is taken as the terminal cost of MPC, thus improving the generated policy. The advantage of doing so is that it rules out the need for the offline design paradigm of the terminal cost, the auxiliary controller, and the terminal constraint in traditional MPC. Moreover, RLMPC proposed in this article enables a more flexible choice of prediction horizon due to the elimination of the terminal constraint, which has great potential in reducing the computational burden. We provide a rigorous analysis of the convergence, feasibility, and stability properties of RLMPC. Simulation results show that RLMPC achieves nearly the same performance as traditional MPC in the control of linear systems and exhibits superiority over traditional MPC for nonlinear ones.
Page(s): 3312 - 3324
Date of Publication: 19 May 2023
ISSN Information:
PubMed ID: 37204957
Publisher: IEEE
Funding Agency:

SECTION I.

Introduction

The development of modern unmanned systems places higher demands on the performance of the controllers. Typically, these systems are constrained, and violation of constraints may lead to unsafe behaviors that can severely affect system operation. Model predictive control (MPC) is capable of providing an optimal solution while handling the constraints explicitly, which has seen significant success in recent decades with wide applications in diverse fields [1], [2], [3], [4], [5], [6].

The successful applications of MPC attracted tremendous academic interest, and soon a rigorous stability-centered MPC theoretical foundation was established. The early stability results on MPC employ a zero-state terminal equality constraint [7], [8]. They were subsequently extended to the use of a terminal inequality constraint by taking a control Lyapunov function as the terminal cost [9], [10], [11], which established the currently well-known framework for stability. Based on this framework, a commonly adopted MPC design paradigm involves finding an appropriate terminal cost and a terminal controller as well as a terminal set that strictly meet specific conditions. While this is relatively easy for linear systems, it is generally a challenging task for nonlinear ones. Even if a qualified terminal cost, a terminal controller, and the corresponding terminal set are found, the conditions for stability and recursive feasibility are usually quite conservative. In practice, however, the theoretically well-established MPC approaches are prohibitively difficult to apply due to the high computational burden. Instead, MPC without terminal cost and terminal constraint enjoys wide popularity. Its closed-loop stability and recursive feasibility can be guaranteed under some conditions [12], [13]. But the absence of terminal cost entails that the residual costs outside the truncated prediction horizon are completely ignored, which makes it difficult to achieve optimal performance under a very limited prediction horizon. Motivated by the analysis above, we aim to develop a learning-based MPC scheme to learn an appropriate terminal cost function without complex offline design procedures while improving the performance of the controller.

In recent years, striking progress of reinforcement learning (RL), such as AlphaGo [14] and Atari games [15], has drawn the attention of the control community. Different from MPC’s offline design, RL optimizes the control policy through online data-based adaptation [16], [17]. Observed state transitions and costs (or rewards) are the only inputs to the RL agent, and no prior knowledge of the system dynamics is needed [18], [19]. A central idea in RL is temporal-difference (TD) learning [20], [21], [22], which estimates the value function directly from raw experience in a bootstrapping way, without waiting for a final outcome. The value function is a prediction of the expected long-term reward at each state. When the optimality is reached, it encodes the global optimal information so that the infinite-horizon optimal control policy can be obtained.

Nevertheless, to approximate the global optimal policy, the RL agent tends to try different policies and learns via trial and error, thereby struggling to provide safe guarantees on the resulting behaviors. This problem becomes prominent when confronted with some safety-critical systems, and safety in this context is defined in terms of stability. There have been some achievements centered on safe RL recently [23], [24], [25], but it remained largely an open field.

In view of the powerful data-driven optimization capability of RL, if combined with MPC to optimize the controller design with running data, it can be expected to improve the control performance. At the same time, benefiting from the solid theoretical foundation of MPC, the properties of the behaviors produced by the RL agent would be easier to analyze. This motivates the attempts to integrate the best of both worlds. However, only limited pioneering work has been reported in this field. For example, in [26], a novel “plan online and learn offline” framework was proposed by taking MPC as the trajectory optimizer of RL. It is shown that MPC enables a more efficient exploration, thereby accelerating the convergence and reducing the approximation errors in the value function. The related work was extended to the framework of actor-critic (AC) in [27] to handle the case of sparse and binary reward, which shows that MPC is an important sampler that effectively propagates global information. These results were further applied to the tasks defined by simple and easily interpretable high-level objectives on a real unmanned ground vehicle in [28], and the expression of the value function was improved. Yet, the focus of these works is on the improvements that MPC brings to RL practice, with no reference to stability. The stability issue was highlighted in [29] where an economic MPC was used as a function approximator in RL. This scheme was further analyzed in [30] under an improved algorithmic framework. These two papers paved the way for [31], where the constraint satisfaction concern was addressed with the help of robust linear MPC. The practicality of these approaches, however, remains to be investigated and verified. To the best of the authors’ knowledge, few studies can balance the theoretical and practical guarantees of the “RL + MPC” approaches, that is, to provide a theoretically safe and operationally efficient scheme, and this article aims to fill this gap.

We propose an “RL + MPC” scheme from a new perspective: policy iteration (PI). Considering that MPC’s ability to handle constraints can provide constraint-enforcing policies for RL agents, safety referred to in this article is not only limited to stability, but also constraint satisfaction. The main contributions of this article contain threefold as follows.

  1. We provide a new idea for combining RL and MPC by bridging them through PI. In this way, a complete RL-based model predictive control (RLMPC) scheme is developed for discrete-time systems. In each iteration, the current policy is evaluated through learning to obtain the value function. Then it is employed as the terminal cost to compensate for the suboptimality induced by the truncated prediction horizon of MPC, thus improving the policy. This solves the challenge of complex offline design procedures while progressively improving the performance to (near-)optimum.

  2. The convergence of learning, recursive feasibility, and closed-loop stability of the proposed RLMPC are closely investigated, thereby theoretically guaranteeing its safety. We demonstrate that no constraint is violated even before the optimal value function is well-learned, which verifies the ability of MPC to effectively constrain the RL-produced policies within safe limits.

  3. We incorporate the value function approximation (VFA) technique into the developed RLMPC approach to approximate the global optimal value function with high computational efficiency. The effect of the prediction horizon on the control performance is scrutinized. Results show that the proposed RLMPC scheme can achieve nearly the same optimal performance as traditional MPC in linear system control and outperform it in nonlinear cases, which is due to the conservativeness of the offline MPC design for nonlinear systems. In addition, thanks to the elimination of the terminal controller and terminal constraint, the RLMPC scheme is particularly useful in dealing with systems where it is difficult or even impossible to design a terminal controller, such as nonholonomic systems, while reducing computational burden by flexibly adjusting its prediction horizon.

The rest of this article is organized as follows. In Section II, we formally outline the problem formulation. The RLMPC scheme is developed in Section III. Its convergence, optimality, feasibility, and stability properties are analyzed in Section IV. Section V discusses the implementation of the overall scheme. We test the proposed scheme in simulations of both linear and nonlinear examples in Section VI. Section VII provides final conclusion.

Notation: We use the following convention: R denotes the set of reals and Rn is an n -dimensional set of real numbers. N is the set of natural numbers. For some r1,r2R , we use Rr1 , N>r2 , and N[r1,r2] to represent the sets {rR|rr1} , {rN|r>r2} , and {rN|r1rr2} , respectively. We label the variables of the optimal solution with , feasible solution with ~ , and estimated ones with ^ , respectively. Moreover, the notations xi|k and ui|k indicate the state and input prediction i steps ahead from the current time k , respectively. The sequence {u0|k,u1|k,,uN1|k} is denoted by uk or by uNk if we want to emphasize its length N .

SECTION II.

Problem Formulation

A. Optimal Control Problem

Consider a dynamic system described by the following state space difference equation:

xk+1=f(xk,uk)(1)
View SourceRight-click on figure for MathML and additional features. where xkRn and ukRm are the system state and control input at time kN , respectively. It is assumed that the system is subject to the constraints
xkX,ukU(2)
View SourceRight-click on figure for MathML and additional features.
where X and U are compact sets and contain the origin as an interior point. Also, system (1) is assumed to satisfy the following conditions.

Assumption 1:

The function f:Rn×RmRn is continuous with f(0,0)=0 . Under the constraints (2), the system (1) is stabilizable with respect to the equilibrium at xk=0 , uk=0 , and the system state xk is measurable.

Remark 1:

Assumption 1 is a nonintrusive and common assumption in the MPC community and can be found in numerous literatures (e.g., [9], [10], [11], [12], [13]). For a system x˘k+1=f(x˘k,u˘k) with a general equilibrium (xe,ue) , one can always make variable changes xk=x˘kxe and uk=u˘kue to translate it to system (1) with the equilibrium (0,0) .

For the particular setup considered above, the infinite-horizon optimal control problem at time k with the initial state xkX is then described by Problem 1.

Problem 1:

Find a control policy uk=π(xk) , which is defined as a map from state space to control space π:RnRm and results in a control sequence u={uk,uk+1,} , such that it stabilizes the zero equilibrium of (1) and minimizes the infinite-horizon cost

J(xk,u)=i=k(xi,ui)(3)
View SourceRight-click on figure for MathML and additional features. subject to constraints (1) and (2), with the running cost :Rn×RmR0 satisfying
(0,0)=αK1(x)0ˇ(x)infuU(x,u)αK2(x)(4)
View SourceRight-click on figure for MathML and additional features.
for xX , where αK1 and αK2 are K functions [32].

Remark 2:

The running cost defined by (4) is a fairly general form and is common in many MPC research studies (e.g., [13], [33]). Its specific form is task-dependent, for example, it usually takes a quadratic form in regulation problems: (x,u)=xTQx+uTRu , where Q and R are set to be positive-definite matrices with proper dimensions.

B. Background of Model Predictive Control

Generally, there is no analytic solution to Problem 1, especially for constrained nonlinear systems. An efficient method to solve this problem is MPC [9], which provides an approximated solution by recursively solving the following OP 1 at each xkX .

OP 1: Find the optimal control sequence uk by solving the finite-horizon minimization problem

minukJ(xk,uk)=i=0N1(xi|k,ui|k)+F(xN|k)s.t.x0|k=xkxi+1|k=f(xi|k,ui|k),0iN1ui|kU,0iN1xi|kX,1iN1xN|kXΩ(5a)(5b)(5c)(5d)(5e)(5f)
View SourceRight-click on figure for MathML and additional features. Where NN>0 is the prediction horizon, F() is the terminal cost, uk={u0|k,u1|k,,uN1|k} is the predicted control sequence, XΩ is the terminal set, and (5f) is the terminal constraint.

After solving OP 1, apply only the first element in uk , that is, uk=u0|k , to system (1) and repeat this procedure.

To guarantee the recursive feasibility and stability, it is required that N is chosen properly so that xN|k is guaranteed to enter a positively invariant set XΩ (under a terminal controller κ(xk)U ) containing the origin. Moreover, parameters of the running and terminal costs should be designed to satisfy F(xN|k)i=N(xi|k,κ(xi|k)) for xkXΩ , such that F() is a local Lyapunov function for κ(xk) in XΩ .

As mentioned in Section I, it is generally a difficult task to design such terminal conditions (i.e., terminal cost, terminal constraint, and terminal controller) for nonlinear systems. Although MPC without terminal condition (MPCWTC) is proposed as an option to circumvent this challenge, its performance under a very limited (but necessary) prediction horizon tends to be inferior to that of the kind with terminal cost. Therefore, we propose to introduce RL into MPC to learn the terminal cost online, thereby generating a policy that is closer to the optimal one in the sense of an infinite horizon.

C. Background of RL

An important class of RL techniques aims to learn the optimal policy π(x) by iteratively obtaining the evaluation of the current policy and then improving the policy accordingly [34]. At state xk , the value function Vπ:RnR , or the evaluation, for a given policy uk=π(xk) is the accumulated rewards (or running costs) by following the policy from xk , that is, Vπ(xk)=i=k(xi,ui) , which can be rearranged into the well-known Bellman equation

Vπ(xk)=(xk,uk)+Vπ(xk+1).(6)
View SourceRight-click on figure for MathML and additional features.

According to Bellman’s principle, when the optimality is achieved, the optimal value function Vπ(xk) satisfies the Bellman optimality equation [35]

Vπ(xk)=minuk{(xk,uk)+Vπ(xk+1)}(7)
View SourceRight-click on figure for MathML and additional features. and the corresponding optimal policy is given by
π(xk)=argminuk{(xk,uk)+Vπ(xk+1)}.(8)
View SourceRight-click on figure for MathML and additional features.

Nevertheless, the calculation of Vπ(xk) is generally computationally difficult, and Vπ(xk) is unknown before all the policies π(xk)=ukU are considered. This motivates the TD learning approach [36], which provides a more tractable computation. In TD learning, the running costs are observed to construct the TD target VTD(xk) , thereby iteratively updating the value function as

Vπ(xk)Vπ(xk)+α[VTD(xk)Vπ(xk)](9)
View SourceRight-click on figure for MathML and additional features. where α(0,1) is a constant called the learning step size, and VTD(xk)Vπ(xk) is known as the TD error. The Bellman equation (6) holds whenever the TD error is zero, thus obtaining the evaluation of policy π(xk) . Here, to be consistent with MPC, the TD target can be constructed into the N -step form
VTD(xk)=i=kk+N1(xi,ui)+Vπ(xk+N).(10)
View SourceRight-click on figure for MathML and additional features.

It can be seen that this TD target has a similar form as the MPC cost function (5a), which builds up the connection between MPC and RL, and their combination gives rise to the RLMPC algorithm to be presented in Section III–​ V.

SECTION III.

RL-Based MPC

In the proposed RLMPC, the value function and control policy are updated in the PI manner. Specifically, we take a currently obtained heuristic term as the terminal cost of MPC, and the corresponding control input is the current control policy. The costs incurred by following this policy, which can be regarded as the rewards for the controller interacting with the system dynamics, are used as training data for the learning of the value function. VFA technique is employed to approximate the underlying value function, that is, the evaluation of the current policy. With the latest learned value function, we take it as the heuristic term in the cost function, thus updating the MPC controller, that is, the current policy. Fig. 1 illustrates the RLMPC pipeline.

Fig. 1. - RLMPC pipeline.
Fig. 1.

RLMPC pipeline.

A. Policy Generator—MPC

In this section, we formulate the MPC problem to be solved at each step (see OP 2). Since MPC only looks N steps forward at each control period, it ultimately produces a locally optimal policy for Problem 1 unless coupled with a terminal cost that propagates global information [26]. Motivated by this, we take the heuristic term learned through RL, namely the value function, as the terminal cost. Based on the results presented in [13], we modify the formulation of MPC as follows.

OP 2: Find the optimizer uk of the following problem:

minukJ(xk,uk)=i=0N1(xi|k,ui|k)+Vπ(xN|k)s.t.x0|k=xkxi+1|k=f(xi|k,ui|k),0iN1ui|kU,0iN1xi|kX,1iN(11a)(11b)(11c)(11d)(11e)
View SourceRight-click on figure for MathML and additional features. where the control sequence uk={u0|k,u1|k,,uN1|k} is the decision variable, and Vπ() is a heuristic term obtained through the learning technique to be presented in Section III-B. We define the MPC control policy generated under a fixed Vπ() to be a fixed policy.

Up to this point, we obtain the policy π(xk) implicitly defined by OP 2: solving OP 2 at time k to yield a control sequence uk={u0|k,u1|k,,uN1|k} , and the first element is applied to system (1).

Remark 3:

A significant advantage of this formulation lies in the fact that it not only inherits the advantages of MPCWTC [13], but also has the potential to further improve the control performance by compensating the costs neglected by the truncated prediction horizon through the learned heuristic function term.

In this article, the initial policy π0(xk) is generated by solving OP 2 with Vπ()0 and implemented in the receding horizon fashion, which is exactly the MPCWTC. To ensure its stability and recursive feasibility, we make the following assumptions.

Assumption 2 (Controllability Assumption):

There exists a positive constant βR>0 such that

Vπ(xk)βˇ(xk)xkX(12)
View SourceRight-click on figure for MathML and additional features. where Vπ(xk) and ˇ(xk) are defined in (7) and (4), respectively.

Lemma 1 (See[13]):

If the prediction horizon N of the MPCWTC is chosen to satisfy

N>2lnβlnβln(β1)(13)
View SourceRight-click on figure for MathML and additional features. it is guaranteed that the closed-loop system is asymptotically stable and the MPCWTC optimization problem is recursively feasible.

Assumption 3:

In this context, the prediction horizon of the initial controller satisfies (13).

Remark 4:

Assumption 2 is a fairly standard condition in the MPCWTC literature (e.g., [13], [37], [38]) and is also known as the boundedness condition of the optimal value function. It indicates the controllability of system (1) with respect to the equilibrium, since otherwise Vπ(xk) would go to infinity [13]. Lemma 1 and Assumption 3 jointly determine that the initial controller is stable, which is consistent with the requirement on the initial policy for PI (to be introduced in Section III-C). The method to calculate a suitable β can be found in [39] and [40].

B. Learning of the Value Function

Now we focus on how to perform policy evaluation on a fixed control policy π(xk) at time k . As mentioned in Section II-C, the essence of policy evaluation is to obtain Vπ(xk) for π(xk) on the set X . For the continuous state space, however, it is impossible to obtain Vπ(xk) at every xkX , because we could not traverse all states over the state space. Therefore, the VFA technique [36] is adopted in this article to approximate the existing true value function.

According to the higher-order Weierstrass approximation theorem [41], there exists a dense polynomial basis set {Φi(xk)} such that one can approximate the true value function Vπ(xk) , xkX in the following form:

Vπ(xk)==i=1WiΦi(xk)=i=1pWiΦi(xk)+i=p+1WiΦi(xk)WTΦ(xk)+e(xk)=V^π(xk,W)+e(xk)(14)
View SourceRight-click on figure for MathML and additional features. where V^π(xk,W) is an approximation of Vπ(xk) , W=[W1,,Wp]TRp is the weight vector, and Φ(xk)=[Φ1(xk),,Φp(xk)]T is the basis vector. The approximation error e(xk) converges uniformly to zero as p . Note that this approximation is essentially a single-layer network, and all we have to do is to feed training data into this network.

Consider a set containing q (qN>0 ) states sampled in X at time k , denoted as Sk={xk1,xk2,,xkq}X . Take each of its elements separately as initial states and propagate one step using this policy, then we record the incurred costs as training data. Specifically, for each xkjSk , jN[1,q] , solve OP 2 (whose terminal cost is a fixed heuristic function corresponding to the current policy) with the subscript k replaced by kj , obtain the solution ukj , and calculate the cost J(xkj,ukj) according to (11a). Note that we do not apply any ukj to the real system in this process, so the new subscript kj is adopted here to distinguish it from the actual system state.

With all these costs J(xkj,ukj) , jN[1,q] , we are now able to train the network similar to the TD learning. Recall that the purpose of (9) is to approximate the TD target with Vπ(xk) , we now directly use V^π(xkj,W) to approximate the targets J(xkj,ukj)

V^π(xkj,W)V^π(xkj,W)+α[J(xkj,ukj)V^π(xkj,W)](15)
View SourceRight-click on figure for MathML and additional features. for jN[1,q] , and we aim to minimize E(xkj)=1/2e(xkj)2 , where e(xkj)=J(xkj,ukj)V^π(xkj,W) . The stochastic gradient descent method [36] achieves this minimization by adjusting the weight vector W in a small step size α(0,1) at a time in the direction of the negative gradient (with respect to W ) of the squared error
W==W12αW[J(xkj,ukj)V^π(xkj,W)]2W+α[J(xkj,ukj)V^π(xkj,W)]WV^π(xkj,W)W+α[J(xkj,ukj)WTΦ(xkj)]Φ(xkj),jN[1,q](16)
View SourceRight-click on figure for MathML and additional features.
in which W denotes the partial derivatives with respect to the components of W .

After W converges for all the training data, that is, for xkjSk , [J(xkj,ukj)V^π(xkj,W)]0 holds, we obtain the (approximated) evaluation of the given fixed control policy. Provided with sufficient training data and a proper choice of the basis vector, we have V^π(xk,W)Vπ(xk) , xkX . It is worth noting that with VFA, we can characterize the value function over the state space using only the storage space of a p -dimensional vector W , which enhances the utility of the proposed approach.

C. PI in RLMPC

In RLMPC, the value function and the control policy are updated by iterations with index tN increasing from zero to infinity. For convenience, we denote the heuristic term employed in the t th iteration as Vtπ() , and the policy generated by the MPC with Vtπ() as the terminal cost is denoted as πt() . We are now in a position to present the PI mechanism in RLMPC.

  1. Initialization: Given an initial state xkX at time k , we start the iteration with an initial heuristic term V0π()0 and a corresponding stabilizing policy π0(xk) whose existence is already guaranteed by Assumptions 2 and 3. For t=0,1,2, , do the following two steps iteratively.

  2. Policy Evaluation Step: Determine the value function Vt+1π() for policy πt() by following the steps presented in Section III-B: first, sample in X to form Sk . Second, generate training data Jt(xkj,utkj) , xkjSk , with

    Jt(xkj,utkj)==minukjU{i=0N1(xi|kj,ui|kj)+Vtπ(xN|kj)}i=0N1(xti|kj,uti|kj)+Vtπ(xtN|kj)(17)
    View SourceRight-click on figure for MathML and additional features. where xkj=xt0kj , xtikj , and utikj , iN[0,N1] are deduced from model (11c), satisfying (11d) and (11e). Third, train the network using (16) to obtain the value function
    Vt+1π(xk)Jt(xk,utk)xkX.(18)
    View SourceRight-click on figure for MathML and additional features.

  3. Policy Improvement Step: According to the descriptions in Section III-A, the updated policy πt+1(xk) is yielded by solving OP 2 with terminal cost Vt+1π()

    ut+1k=argminut+1kU{i=0N1(xi|k,ui|k)+Vt+1π(xN|k)}(19)
    View SourceRight-click on figure for MathML and additional features. and ut+1k is implemented in a receding horizon manner.

In what follows, we focus on analyzing the theoretical properties of this RLMPC scheme.

SECTION IV.

Convergence, Feasibility, and Stability Analysis

In this section, we analyze the convergence, feasibility, and stability properties. We show that for xkX , the value function Vtπ(xk)Vπ(xk) and the control policy πt(xk)π(xk) as the iteration index t . Also, we demonstrate that the policy obtained in each iteration is a feasible and stabilizing policy for the closed-loop system. The convergence proof is inspired by [42] which demonstrates the convergence of optimal control (unconstrained infinite horizon optimization) under PI. This article leverages its proof idea to extend the results to RLMPC. Before proceeding, we introduce the following definition.

Definition 1:

For the RLMPC optimization problem OP 2, a control sequence u~k={u~0|k,,u~N1|k} is said to be feasible if u~i|kU,i=N[0,N1] and the resulting system trajectory x~k={x~0|k,,x~N|k} satisfies x~i|kX,i=N[0,N] , rendering the cost function J(x~k,u~k) in the form of (11a) finite.

A. Convergence Analysis

We first present two theorems to respectively demonstrate that the value function Vtπ(xk) is uniformly monotonically nondecreasing with respect to the iteration index t and show that it has an upper bound. Then, based on the two theorems, it is shown in the third theorem that Vtπ(xk) converges to the optimal solution as t .

Theorem 1:

For tN and xkX , let Vtπ(xk) and πt(xk) be obtained by PI presented in Section III-C, then Vtπ(xk) is a uniformly monotonically nondecreasing sequence for t , that is, t:Vtπ(xk)Vt+1π(xk) .

Proof:

Define a new value function Γt(xk) with Γ0()0 . It is followed by the update law that

Γt(xk)=i=0N1(x~i|k,u~i|k)+Γt1(x~N|k)(20)
View SourceRight-click on figure for MathML and additional features. where {u~0|k,,u~N1|k}=u~k is an arbitrary feasible control sequence as defined in Definition 1, x~0|k=xk , and x~i|k,i=N[1,N] , is the resulting trajectory of u~k . Since Γ0()=V0π()=0 , we have
Γ1(xk)===i=0N1(x~i|k,u~i|k)+Γ0(x~N|k)minukU{i=0N1(xi|k,ui|k)+Γ0(xN|k)}minukU{i=0N1(xi|k,ui|k)+V0π(xN|k)}i=0N1(x0i|k,u0i|k)+V0π(x0N|k)=V1π(xk).(21)
View SourceRight-click on figure for MathML and additional features.

The last two equations hold from (17) and (18), indicating the following update law:

Vt+1π(xk)===Jt(xk,utk)i=0N1(xti|k,uti|k)+Vtπ(xtN|k)minukU{i=0N1(xi|k,ui|k)+Vtπ(xN|k)}.(22)
View SourceRight-click on figure for MathML and additional features.

Assume that at t=l , Γl(xk)Vlπ(xk) holds for xkX , then for t=l+1

Γl+1(xk)===i=0N1(x~i|k,u~i|k)+Γl(x~N|k)minukU{i=0N1(xi|k,ui|k)+Γl(xN|k)}i=0N1(x¯li|k,u¯li|k)+Γl(x¯lN|k)i=0N1(x¯li|k,u¯li|k)+Vlπ(x¯lN|k)minukU{i=0N1(xi|k,ui|k)+Vlπ(xN|k)}i=0N1(xli|k,uli|k)+Vlπ(xlN|k)=Vl+1π(xk)
View SourceRight-click on figure for MathML and additional features. where u¯li|k , x¯li|k , i=0,1, , are the solutions to the problem minukU{N1i=0(xi|k,ui|k)+Γl(xN|k)} . Through induction, we conclude that
t:Γt(xk)Vtπ(xk).(23)
View SourceRight-click on figure for MathML and additional features.

Next, we use this conclusion to prove that Vtπ(xk) is a uniformly monotonically nondecreasing sequence about t . Again, the mathematical induction method is adopted. First, since V1π(xk)=N1i=0(x0i|k,u0i|k)0 and Γ0(xk)=0 , it is obvious that Γ0(xk)V1π(xk) . Second, if we let the arbitrary feasible policy u~k be utk in (20), we have the following update law at t=l :

Γl(xk)=i=0N1(xli|k,uli|k)+Γl1(xlN|k).(24)
View SourceRight-click on figure for MathML and additional features.

At this time, assume that Γl1(xk)Vlπ(xk) holds for xkX , then at t=l+1 , from (24) and (22), we derive

Γl(xk)Vl+1π(xk)=Γl1(xlN|k)Vlπ(xlN|k)0.(25)
View SourceRight-click on figure for MathML and additional features.

Hence, Γl(xk)Vl+1π(xk) holds. Therefore, we conclude that t:Γt(xk)Vt+1π(xk) . Combining (23), we can draw the conclusion that

t:Vtπ(xk)Γt(xk)Vt+1π(xk)(26)
View SourceRight-click on figure for MathML and additional features. which completes the proof.

Remark 5:

Since V0π()=0 and the running cost (4) is positive-definite, Theorem 1 indicates that the value function Vtπ(xk) obtained in any iteration tN1 is positive-definite.

Theorem 2:

For tN and xkX , let Vtπ(xk) and πt(xk) be obtained by PI presented in Section III-C, and Vπ(xk) be the optimal value function satisfying the Bellman optimality equation

Vπ(xk)=minukU{i=0N1(xi|k,ui|k)+Vπ(xN|k)}(27)
View SourceRight-click on figure for MathML and additional features. then Vπ(xk) is an upper bound of Vtπ(xk) , that is,
t:Vtπ(xk)Vπ(xk).(28)
View SourceRight-click on figure for MathML and additional features.

Proof:

We prove this by induction. First, at t=0 , V0π(xk)=0Vπ(xk) . Second, we assume that at t=l , Vlπ(xk)Vπ(xk) holds for xkX , then at t=l+1 , we have

Vl+1π(xk)==minukU{i=0N1(xi|k,ui|k)+Vlπ(xN|k)}minukU{i=0N1(xi|k,ui|k)+Vπ(xN|k)}Vπ(xk).(29)
View SourceRight-click on figure for MathML and additional features.

Thus, Vtπ(xk)Vπ(xk) holds for tN , and Vπ(xk) is an upper bound. The proof is completed.

Based on the properties revealed by Theorems 1 and 2, we give the following theorem to show the convergence of RLMPC.

Theorem 3:

For tN and xkX , let Vtπ(xk) and πt(xk) be obtained by PI presented in Section III-C. When t , limtVtπ(xk)=Vπ(xk)=Vπ(xk) and limtπt(xk)=π(xk)=π(xk) .

Proof:

From Theorems 1 and 2, we know that Vtπ(xk) is a uniformly monotonically nondecreasing and upper-bounded sequence about t . According to the monotone convergence theorem [43], Vtπ(xk) converges to a value, denoted by Vπ(xk) , as t . Since limtVtπ(xk)=limtVt+1π(xk)=Vπ(xk) , we obtain

Vπ(xk)=====limtVt+1π(xk)limtminukU{i=0N1(xi|k,ui|k)+Vtπ(xN|k)}minukU{i=0N1(xi|k,ui|k)+limtVtπ(xN|k)}minukU{i=0N1(xi|k,ui|k)+limtVt+1π(xN|k)}i=0N1(xi|k,ui|k)+Vπ(xN|k)
View SourceRight-click on figure for MathML and additional features. which is the Bellman optimality equation. Thus, Vπ(xk)=Vπ(xk) and the resulting policy limtπt(xk)=π(xk)=π(xk) . The proof is completed.

B. Stability and Feasibility Analysis

Theorem 4:

For tN and xkX , let Vtπ(xk) and πt(xk) be obtained by PI presented in Section III-C, then for every tN , the RLMPC policy πt(xk) is feasible and the closed-loop system is asymptotically stable under πt(xk) .

Proof:

For t=0 , Assumption 3 has already ensured the asymptotic stability and feasibility of the initial policy π0(xk) , xkX . Before proceeding, we define

VN~(xk)=minuN~kUi=0N~1(xi|k,ui|k)(30)
View SourceRight-click on figure for MathML and additional features. subject to (11b)–(11e) with N replaced by N~ , where N~ can be an arbitrary positive integer, and uN~k={u0|k,u1|k,,uN~1|k} is the decision variable of length N~ . By letting N~=N , it is obvious that VN(xk)=J0(xk,u0Nk)=V1π(xk) holds for xkX . Furthermore, we can also obtain
V2N(xk)===minu2NkUi=02N1(xi|k,ui|k)minu2NkU{i=0N1(xi|k,ui|k)+i=N2N1(xi|k,ui|k)}minuNkU{i=0N1(xi|k,ui|k)+VN(xN|k)}(31)
View SourceRight-click on figure for MathML and additional features.
where the last equation holds because of Bellman’s optimality principle. As a result, invoking (22) leads to
V2N(xk)==minuNkU{i=0N1(xi|k,ui|k)+V1π(xN|k)}J1(xk,u1Nk)=V2π(xk).(32)
View SourceRight-click on figure for MathML and additional features.

This illustrates that the policy generated by RLMPC with V1π(xN|k) as the terminal cost is equivalent to that of the MPCWTC with a prediction horizon length of 2N . By induction, it is easy to verify

V(t+1)N(xk)=Jt(xk,utNk)=Vt+1π(xk)tN(33)
View SourceRight-click on figure for MathML and additional features. which implies that the policy πt(xk) is equivalent to the MPCWTC with prediction horizon (t+1)N . Given that N satisfies (13) and hence (t+1)N2lnβ/[lnβln(β1)] also holds, the asymptotic stability and feasibility of πt(xk) can be guaranteed according to Lemma 1. The proof is completed.

Remark 6:

The above proof reveals that the essence of RLMPC is to achieve the effect of accumulating the prediction horizon on the basis of the initial controller through iterations. As t , πt(xk) converges to the policy generated by the MPCWTC with an infinite prediction horizon, that is, the optimal policy. This property echoes the results given by Theorem 1–​Theorem 3. In this sense, (33) serves as a performance indicator for each iteration, and N determines the convergence rate (CR).

Remark 7:

For specific control tasks, the MPCWTC could generally achieve (near-)optimal performance with a very limited prediction horizon 0<N¯ , without requiring an infinite one. In other words, if the prediction horizon of MPCWTC is no less than N¯ , increasing its prediction horizon hardly improves the performance. This indicates that we do not have to iterate RLMPCs to t , but can stop at 0<t¯ if Vt¯π(xk) hardly changes with a larger t¯ . Furthermore, with Vt¯π(xk) being the terminal cost, one can flexibly adjust the prediction horizon of RLMPC and resulting in almost the same control policy πt¯(xk)π(xk) , as the Bellman optimality equation (7) inherently holds at any prediction horizon. Since the prediction horizon determines the dimensionality and hence the computational burden of the optimization problem, such a property provides a way to significantly reduce the computational burden by shortening the prediction horizon of RLMPC.

SECTION V.

Implementation Issues on RLMPC

A. On the PI

Theorems 1 and 2 reveal that as t increases, Vtπ(xk) exhibits a monotonic approximation to Vπ(xk) . This implies that each iteration will necessarily lead to some degree of improvement in Vtπ(xk) . More importantly, Theorem 4 demonstrates that Vtπ(xk) obtained at any tN can result in a stabilizing control policy. This allows us to stop the iteration at any t without affecting the stability of the closed-loop system.

As we can see in Section III-B, the update of Vtπ(xk) is determined by the weight vector W . Denote the weight vector obtained in the t th iteration as Wt (which has converged for all the training data in this iteration) and select a small constant ε>0 as the threshold of change. Instead of iterating infinitely many steps, if Wt+1Wtε is satisfied for some t , then Vtπ(xk) can be regarded as converged to the neighborhood of Vπ(xk) and the iteration can be stopped.

Remark 8:

In contrast to PI, value iteration (VI) is considered less computationally expensive by reducing the policy evaluation step to a single iteration. Although it is possible to adapt RLMPC to the VI update manner, the properties presented in Section IV may be lost. This is due to the fact that the intermediate policies of VI are not guaranteed to be stable and the intermediate value functions may not correspond to any policy. Taking such value functions as the terminal costs, it is difficult to ensure that the corresponding RLMPC generates suitable policies for value function updates. How to ensure the convergence, stability, and feasibility in the case of VI would be our future research.

B. Overall Algorithm of the RLMPC Scheme

Given the state xk , use the policy π0(xk) to control the real system for r01 steps. In the meanwhile, perform the policy evaluation for π0() as presented in Section III-B. Once the value function V1π() is obtained, substitute it into OP 2 as terminal cost and update the policy to π1() . This procedure is repeated for t1 , and πt() is applied to the real system for rt steps (rtN1 is a constant at a given t ). This learning procedure is characterized by the fact that the policy is continually improved throughout the control process.

The pseudo-code of the RLMPC algorithm is shown in Algorithm 1, which is an example of updating the policy at every control step, that is, rt1 , tN .

Algorithm 1 RLMPC Algorithm

Initialize: N , Q , R , α , ε , p , Φ(xk) , W0=0 , t=0 , Flag=0 ;

1:

for k=1,2,3, do

2:

Measure the current state xk at time k ;

3:

Solve OP 2 for xk with terminal cost (Wt)TΦ(xNk) ;

4:

Apply πt(xk)=ut0|k to system (1);

5:

if Flag==0 then

6:

Wtqk0Wt ;

7:

Construct sample set Sk={xtk1,,xtkq}X ;

8:

for j=1,2,,q do

9:

Solve OP 2 for xtkj with terminal cost (Wt)TΦ() ;

10:

Calculate J(xtkj,utkj) according to (17);

11:

WtqkjWtqkj1 ;

12:

Update Wtqkj using (16) until ΔWtqkjε ;

13:

if WtqkjWtqkj1ε then break;

14:

end if

15:

end for

16:

Wt+1Wtqkj ;

17:

end if

18:

if Wt+1Wtε then

19:

Flag==1 ;

20:

end if

21:

tt+1 ;

22:

end for

SECTION VI.

Simulation Examples

To evaluate the effectiveness of the proposed RLMPC scheme, two examples are studied: one involving a linear system and the other involving a nonlinear one. As is known, traditional MPC (which refers to the one described in OP 1) can attain linear quadratic optimal performance for linear systems with properly designed terminal conditions, so the linear system example can be a benchmark for RLMPC performance. For nonlinear systems, however, it can only achieve suboptimal control performance, which is where RLMPC comes in. We will highlight the superiority of RLMPC in the nonlinear system example. In both examples, we investigate RLMPC (following Algorithm 1) for two episodes. The first episode (labeled as “RLMPC Epi. 1”), called the learning episode, is to learn from scratch, starting with the initial policy to control the system; while the second episode (labeled as “RLMPC Epi. 2”), also referred to as the re-execution episode, is to redo the task under the learned value function (LVF) to check the performance.

A. Linear System

Consider the following linear system:

xk+1=Axk+Buk(34)
View SourceRight-click on figure for MathML and additional features. where xk=[x1k,x2k]TR2 , ukR1 , A=[10.10.50.9] , and B=[10] . We set the running cost to be (xk,uk)=xTkQxk+uTkRuk with weights Q=IR2 and R=0.5 . Then, according to the traditional MPC design procedure [9], we solve a Riccati equation and obtain P=[1.43880.34090.34095.0793] with terminal cost F(xN|k)=xTN|kPxN|k , and the terminal controller uk=Kxk with K=[0.75970.2128] being the linear quadratic optimal. Correspondingly, we set RLMPC parameters to be α=106 , ε=108 , p=9 , and Φ(x)=[x1,x2,x21,x22,x1x2,x31,x21x2,x32,x1x22] . Let the prediction horizon be N=3 and the initial state be x0=[2.9,2]T . We collect 100 random samples in each iteration in the state space x1k[0.5,3.5] , x2k[0.5,2.5] , as well as the actual system trajectory, to form Sk . Our task is to drive the system states from x0 to the origin while minimizing (3). We examine the performance of RLMPC and compare it with those of traditional MPC and MPCWTC (labeled as “LQR (MPC)” and “MPCWTC,” respectively, in the following figures). To facilitate comparison with the optimal control performance in the infinite horizon, we do not impose any constraints on this control problem, at which point the above-designed traditional MPC is fully equivalent to LQR.

The relevant state trajectories and control inputs are reported in Fig. 2. It can be seen that the trajectories of “RLMPC Epi. 1” and “MPCWTC” are very close in the initial few steps, which is due to the fact that the initial policy of the learning episode is essentially the MPCWTC. As the policy of the learning episode improves with each control step k , its trajectory gradually deviates from that of “MPCWTC” and converges to that of “LQR (MPC).” The evolution curves of the weight vector W in the learning episode are shown in Fig. 3. We observe that W converges at the 11th step (at this moment, the system states are not yet regulated to the origin), indicating the convergence of the learning process. Since Algorithm 1 is an example of performing one iteration at each step, we know that the learning process converges at the 11th iteration. To examine the optimality property, we compare the performance of “LQR (MPC)” with that of “RLMPC Epi. 2.” It can be seen that their trajectories basically coincide within a certain error range, which verifies the optimality of RLMPC, and the error mainly depends on the selection of ε for terminating the iterations.

Fig. 2. - Trajectories and inputs (linear system).
Fig. 2.

Trajectories and inputs (linear system).

Fig. 3. - Convergence of the weight vector (linear system).
Fig. 3.

Convergence of the weight vector (linear system).

Fig. 4 illustrates the accumulated costs (ACCs), which is defined as the sum of the running costs from the initial time to the current time k

ACC=i=0k(xi,ui).(35)
View SourceRight-click on figure for MathML and additional features.

Fig. 4. - ACCs (linear system).
Fig. 4.

ACCs (linear system).

As expected, the ACC of “MPCWTC” is the highest because it only looks at N steps ahead and ignores all the residual costs in each control period. The ACC of “RLMPC Epi. 1” ranks the second highest as its policy is gradually improving from “MPCWTC. ” “RLMPC Epi. 2” achieves almost the same minimum as “LQR (MPC),” while the slightly higher value in ACC (about 4×105 ) is mainly due to the precision setting of the iteration termination. In conclusion, the proposed RLMPC approach can deliver the (near-)optimal policy in the sense of an infinite horizon for the control of a linear system.

To verify the statements in Remark 6, we first test the equivalence of RLMPC and the MPCWTC with an accumulating prediction horizon. Since the prediction horizon of RLMPC is set to be N=3 , following Algorithm 1 and the results in Section IV-B, it is theoretically equivalent to the MPCWTC with N=3k , where kN>0 is the step index. We observe from the two subgraphs in the upper part of Fig. 5 that their inputs and hence their trajectories are basically the same, which confirms this equivalence. We then investigate the effectiveness of increasing the prediction horizon by comparing the performance of RLMPC in the learning episode with N=3,5,7,9,11 , respectively. As can be seen from the lower subgraph of Fig. 5, the larger the prediction horizon, the closer the corresponding trajectory is to the optimal trajectory “LQR (MPC).” This can be explained by the fact that the equivalent MPCWTC has an increased accumulating prediction horizon in every iteration, so the policy at each step is closer to the optimal one. Furthermore, we present the ACCs and the CRs (which are measured by the number of iterations required for convergence) in Table I. It is obvious that a larger prediction horizon brings lower ACC and faster convergence in the learning episode, which is consistent with our theoretical results.

TABLE I ACCs and CRs of RLMPC in the Learning Episode Under Different Prediction Horizons
Table I- 
ACCs and CRs of RLMPC in the Learning Episode Under Different Prediction Horizons
Fig. 5. - Trajectories of RLMPC in the learning episode under different prediction horizons (linear system).
Fig. 5.

Trajectories of RLMPC in the learning episode under different prediction horizons (linear system).

We finally investigate the data efficiency of the proposed scheme and compare it with Q-learning [44] and PILCO [45]. The former employs the action-value function and has a very similar learning process to the proposed scheme, except that it is model-free. The latter is a model-based policy search scheme and is well known for its data efficiency. By implementing Q-learning and PILCO1 in the optimal control of (34), we observe that the amount of training data required for them to converge to the optimal policy is 1.3×104 and 400 (10 trials and 40 training data in each trial), respectively. In contrast, RLMPC only needs 341 training data, far less than Q-learning, and achieves efficiency comparable to that of PILCO. This is mainly because Q-learning utilizes no prior knowledge of the system and therefore requires a large amount of data from interactions with the environment to provide a basis for policy improvement. While PILCO and RLMPC optimize the policy with the assistance of the model, thus significantly reducing the number of interactions and accelerating convergence. In addition, due to the combination with MPC, the significant advantage of RLMPC over PILCO is the theoretical guarantee for the stability and feasibility of the policies generated throughout the learning process.

B. Nonlinear System

Consider the regulation problem of a nonholonomic vehicle system

xk+1=xk+g(xk)uk(36)
View SourceRight-click on figure for MathML and additional features. where xk=[χk,yk,θk]T is the state vector, with χk , yk , and θk representing the abscissa, ordinate, and yaw angle in the geodetic coordinate system, respectively; uk=[vk,ωk]T is the control input, with vk and ωk representing the linear velocity and angular velocity, respectively; δ is the sampling interval; and g(xk)=δ[cosθksinθk0001] . Set δ=0.2s , the input constraint |vk|1m/s , |ωk|4rad/s , and the state constraint 0χ2m . The control objective is to drive the system from x0=[1.98,5,π/3]T to the origin while minimizing (3) with running cost in the quadratic form (xk,uk)=xTkQxk+uTkRuk , where Q=diag{1,2,0.06} and R=diag{0.01,0.005} .

Remark 9:

According to Brockett’s theory [46], there does not exist smooth and time-invariant feedback to stabilize system (36), which makes the widely adopted design method for terminal conditions, such as [10], no longer applicable to this system, thus posing a great challenge to the traditional MPC design here. Even if qualified terminal conditions can be found, they tend to be quite conservative, as evidenced by a small terminal set, a largely required prediction horizon, and an overestimated terminal cost. This can lead to the conservative performance of the designed traditional MPC.

1) Design of the Traditional MPC and RLMPC:

In [47], a traditional MPC is proposed for (36): the terminal cost is F(xN|k)=χ2N|k+y2N|k+θ2N|k , and the terminal set XΩ is defined as XΩ={xN|kX:F(xN|k)ϱv} , where ϱv=cϱ , ϱ=min{v¯2/η2,ω¯2/ξ2} , c=(1+T2max{η2,ξ2}max{ηT2+q11/η+r11η,2ξT})<1 , and v¯ and ω¯ are the upper bounds of v and ω , respectively, that is, v¯=1 and ω¯=4 , q11 and r11 are the first elements on the diagonal of Q and R , respectively. By choosing η=ξ=8 , N30 , system (36) can be stabilized to the origin.

Remark 10:

This traditional MPC design is computationally demanding because the prediction horizon should be at least 30 to meet the terminal constraint, which confirms Remark 9. Moreover, there is no guarantee that the control performance of the designed MPC is optimal. In contrast, RLMPC avoids artificially designing the terminal conditions and achieves (near-)optimal performance. With the learned (near-)optimal terminal cost, its prediction horizon can be flexibly adjusted without affecting its performance. Therefore, it is expected to show unique superiority in this problem.

Now we present the RLMPC design: α=106 , ε=107 , p=30 , and the basis functions are chosen as polynomials of order one to six. We generate 100 random samples in each iteration in the state space χk[0,2] , yk[1,6] , θk[π,π] , together with the actual system trajectory, to form Sk .

2) Comparison of the Traditional MPC and RLMPC:

In the following simulations, the prediction horizons of the traditional MPC and RLMPC are set to 30 and 5 steps, respectively.

As we can see from Figs. 6 and 7 that no constraint is violated during both episodes, which demonstrates that RLMPC is able to restrict the policy within the input and state constraints and thus perform safely. Fig. 8 illustrates the convergence of the weight W . We observe that it takes 25 iterations to converge, and the evolution of the LVF is also visualized. Consistent with Theorem 1, this evolution exhibits the uniformly monotonically nondecreasing property in this process. To verify Theorem 4, we apply the policies obtained in the first 25 iterations of the learning episode, respectively, to the system and show the corresponding trajectories and ACCs [as defined in (35)] in Fig. 9. We observe that each policy generates a safe and stabilizing trajectory to the origin and that the ACCs show a gradual improvement, which also corroborates Theorem 1. The comparison of the two episodes of RLMPC and traditional MPC is shown in Fig. 7. Since the traditional MPC design for nonlinear systems does not ensure optimality, RLMPC could achieve better performance (15.09% less ACC) even in the learning episode. There is a further 16% reduction in ACC in the re-execution episode where a near-optimal policy is employed.

Fig. 6. - Inputs of RLMPC (nonlinear system).
Fig. 6.

Inputs of RLMPC (nonlinear system).

Fig. 7. - System trajectories and ACCs of traditional MPC and RLMPC (nonlinear system).
Fig. 7.

System trajectories and ACCs of traditional MPC and RLMPC (nonlinear system).

Fig. 8. - Convergence of the weight vector and LVF (nonlinear system).
Fig. 8.

Convergence of the weight vector and LVF (nonlinear system).

Fig. 9. - System trajectories and ACCs of RLMPC policies in the 1st–25th iterations of the learning episode (nonlinear system).
Fig. 9.

System trajectories and ACCs of RLMPC policies in the 1st–25th iterations of the learning episode (nonlinear system).

3) Prediction Horizon and Computational Burden:

We now demonstrate the relationship between prediction horizon, control performance, convergence, and computational burden and use the following three metrics in the quantitative evaluation: ACC, CR, and the average computational time (ACT) at each step. RLMPC in the learning episode with four different prediction horizons, namely N=5,7,9,30 , are compared (settings of other parameters remain unchanged), as well as that of traditional MPC with N=30 . Corresponding numerical results are listed in Table II.

TABLE II Comparison of Traditional MPC and RLMPC in the Learning Episode With Different Prediction Horizons
Table II- 
Comparison of Traditional MPC and RLMPC in the Learning Episode With Different Prediction Horizons

Consistent with the linear system example, increasing the prediction horizon of RLMPC effectively enhances CR and leads to an improved (lower ACC) trajectory in the learning episode. Conceivably, this comes at the cost of increased computation time at each step. However, we observe that even with the same prediction horizon of 30, the ACT of RLMPC is still slightly lower than that of traditional MPC. This is mainly due to the removal of the terminal constraint, which simplifies the optimization problem.

We now perform a verification of Remark 7. Taking the learned (near-)optimal value function as the terminal cost of RLMPC, we reduce the prediction horizon to 1 and re-execute the regulation task. As we can see from Fig. 10, the trajectories and input under N=1 are basically coincide with those under N=5 , which confirms that shortening the prediction horizon does not affect the (near-)optimal performance. More importantly, the ACT time is reduced from 0.0826 to 0.0222 s (about an improvement of 73%), which illustrates the effectiveness of RLMPC in reducing the computational burden.

Fig. 10. - Comparison of trajectories and inputs of RLMPC with the learned (near-)optimal value function under prediction horizons 
$N=1$
 and 
$N=5$
 (nonlinear system).
Fig. 10.

Comparison of trajectories and inputs of RLMPC with the learned (near-)optimal value function under prediction horizons N=1 and N=5 (nonlinear system).

SECTION VII.

Conclusion

To free the design of the terminal cost, terminal controller, and terminal set in traditional MPC and to improve the closed-loop performance, this article developed a new RLMPC scheme for discrete-time systems by combining RL and MPC through PI. The core of this scheme is to take a value function, which is obtained through learning, as the terminal cost of traditional MPC to stabilize the system. We showed that the value function monotonically converges to the optimal one under PI. Based on this property, the closed-loop stability of RLMPC was further proved. We tested the proposed scheme on a linear system example to show its convergence, safety, optimality, and stability. Furthermore, we also implemented it on a nonholonomic vehicle system to show that RLMPC outperforms traditional MPC in the nonlinear case. Finally, we discussed the impact of the prediction horizon on the learning process and the control performance of RLMPC, highlighting the ability of RLMPC to flexibly adjust the prediction horizon to reduce the computational burden.

    References

    1.
    T. Wang, H. Gao and J. Qiu, "A combined adaptive neural network and nonlinear model predictive control for multirate networked industrial process control", IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 2, pp. 416-425, Feb. 2016.
    2.
    M. L. Darby and M. Nikolaou, "MPC: Current practice and challenges", Control Eng. Pract., vol. 20, no. 4, pp. 328-342, Apr. 2012.
    3.
    M. Neunert et al., "Whole-body nonlinear model predictive control through contacts for quadrupeds", IEEE Robot. Autom. Lett., vol. 3, no. 3, pp. 1458-1465, Jul. 2018.
    4.
    Z. Li, Y. Xia, C.-Y. Su, J. Deng, J. Fu and W. He, "Missile guidance law based on robust model predictive control using neural-network optimization", IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 8, pp. 1803-1809, Aug. 2015.
    5.
    J. Kabzan et al., "AMZ driverless: The full autonomous racing system", J. Field Robot., vol. 37, no. 7, pp. 1267-1294, 2020.
    6.
    Z. Li, W. Yuan, Y. Chen, F. Ke, X. Chu and C. L. P. Chen, "Neural-dynamic optimization-based model predictive control for tracking and formation of nonholonomic multirobot systems", IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, pp. 6113-6122, Dec. 2018.
    7.
    C. C. Chen and L. Shaw, "On receding horizon feedback control", Automatica, vol. 18, no. 3, pp. 349-352, May 1982.
    8.
    M. Alamir and G. Bornard, "On the stability of receding horizon control of nonlinear discrete-time systems", Syst. Control Lett., vol. 23, no. 4, pp. 291-296, Oct. 1994.
    9.
    J. B. Rawlings and D. Q. Mayne, Model Predictive Control: Theory and Design, Madison, WI, USA:Nob Hill Pub, 2009.
    10.
    H. Chen and F. Allgöwer, "A quasi-infinite horizon nonlinear model predictive control scheme with guaranteed stability", Automatica, vol. 34, no. 10, pp. 1205-1217, 1998.
    11.
    D. Q. Mayne, J. B. Rawlings, C. V. Rao and P. O. M. Scokaert, "Constrained model predictive control: Stability and optimality", Automatica, vol. 36, no. 6, pp. 789-814, Jun. 2000.
    12.
    M. Reble, Model Predictive Control for Nonlinear Continuous-Time Systems With and Without Time-delays, Berlin, Germany:Logos Verlag Berlin GmbH, 2013.
    13.
    A. Boccia, L. Grüne and K. Worthmann, "Stability and feasibility of state constrained MPC without stabilizing terminal constraints", Syst. Control Lett., vol. 72, pp. 14-21, Oct. 2014.
    14.
    D. Silver et al., "Mastering the game of go with deep neural networks and tree search", Nature, vol. 529, no. 7587, pp. 484-489, Jan. 2016.
    15.
    V. Mnih et al., "Playing Atari with deep reinforcement learning", arXiv:1312.5602, 2013.
    16.
    H. Li, Q. Zhang and D. Zhao, "Deep reinforcement learning-based automatic exploration for navigation in unknown environment", IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 6, pp. 2064-2076, Jun. 2020.
    17.
    R. Kamalapurkar, L. Andrews, P. Walters and W. E. Dixon, "Model-based reinforcement learning for infinite-horizon approximate optimal tracking", IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 753-758, Mar. 2017.
    18.
    Y. Li, "Reinforcement learning applications", arXiv:1908.06973, 2019.
    19.
    M. Wiering and M. V. Otterlo, Reinforcement Learning, Berlin, Germany:Springer, vol. 12, 2012.
    20.
    R. S. Sutton, "Learning to predict by the methods of temporal differences", Mach. Learn., vol. 3, no. 1, pp. 9-44, Aug. 1988.
    21.
    G. Tesauro, "Practical issues in temporal difference learning", Mach. Learn., vol. 8, no. 3, pp. 257-277, May 1992.
    22.
    J. N. Tsitsiklis and B. Van Roy, "An analysis of temporal-difference learning with function approximation", IEEE Trans. Autom. Control, vol. 42, no. 5, pp. 674-690, May 1997.
    23.
    J. García and F. Fernández, "A comprehensive survey on safe reinforcement learning", J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437-1480, 2015.
    24.
    F. Berkenkamp, M. Turchetta, A. Schoellig and A. Krause, "Safe model-based reinforcement learning with stability guarantees", Proc. Adv. Neural Inf. Process. Syst., pp. 908-918, 2017.
    25.
    M. Gallieri et al., "Safe interactive model-based learning", arXiv:1911.06556, 2019.
    26.
    K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov and I. Mordatch, "Plan online learn offline: Efficient learning and exploration via model-based control", arXiv:1811.01848, 2018.
    27.
    F. Farshidian, D. Hoeller and M. Hutter, "Deep value model predictive control", arXiv:1910.03358, 2019.
    28.
    K. Napat, M. I. Valls, D. Hoeller and M. Hutter, "Practical reinforcement learning for MPC: Learning from sparse objectives in under an hour on a real robot", Proc. Annu. Conf. Learn. Dyn. Control, pp. 211-224, 2020.
    29.
    S. Gros and M. Zanon, "Data-driven economic NMPC using reinforcement learning", IEEE Trans. Autom. Control, vol. 65, no. 2, pp. 636-648, Feb. 2020.
    30.
    M. Zanon, S. Gros and A. Bemporad, "Practical reinforcement learning of stabilizing economic MPC", Proc. 18th Eur. Control Conf. (ECC), pp. 2258-2263, Jun. 2019.